1. Introduction

Background

NMBS/SNCB trains, which is the biggest railway serving the country of Belgium, needs to postpone the expansion because of lack of train personnel staff. Also, the raise on energy cost increase the operating cost. Meanwhile, the company still face revenue loss because of low passengers on some routes.

With these problems, the company can strategically plan their policies on optimizing resource allocation, enhancing operational cost efficiency, and also increasing revenue especially on their low occupancy routes. But how can they do that?

Use Case

This project is aim to predict occupancy levels (categorized by high and low) for different Origin-Destination (OD) pairs of Belgium railway system to allocate resources such as train frequency and size effectively to help transportation planners of NMBS/SNCB or the Transportation Department in Belgium.

e.g. extend/merge train routes, minimize vacant trains or mitigate crowdedness of trains etc.

Question:

What is the level of occupancy of a certain OD pair at a specific time?

What options do planners have to manage this demand?

e.g. decrease amount of trains, merge of two similar routes, or decrease the frequent of the route to change low routes to medium, and of increase amount of trains, separate two similar routes, or increase the frequent of the route to change high routes to medium.

We build this web-based application, called Re-Train that predict the occupancy rate across the routes with three stages, first analysing and visualizing past data, creating long-term or short-term predictions based on the users’ need, and last, showcasing potential benefits and risk on the predictions.

This is the interface which shows the latest occupancy rate where orange shows low occupancy routes. First feature is History trend analysis. On the top-right corner, user can filter based on those variables and click analysis, to see graph and maps of the train occupancy level in the past.

Second feature is prediction. you can predict by filtering like the first one, and choose the period time based on your needs. After clicking the predict, the predicted occupancy rate will result in the percentage of each level and be shown in the map, and also the accuracy percentage & graph below the map.

Last feature, you can check on the cost benefit analysis based on the scenario of the prediction which will show you the potential benefit & potential risk of optimizing routes on those period of time. Then, you can export the report to help you make decision based on that.

How it works?

We use Kaggle train occupancy level data on 2016 which have these variables that potentially correlate with the occupancy rate. And add some external datas like weather, demographic, and fare, and then predict using logistic regression.

From this, we can see high-demand OD pairs are consistent, while low-demand OD pairs keep on changing. High occupancy is concentrated on major intercity and commuter routes. Meanwhile, smaller or regional routes shows reduced demand. But there are two quite big OD Pairs that dominate low-occupancy, possibly indicating a mismatch in train frequency with actual demand.

The modeling approach uses binomial regression to predict train occupancy (high or low) while identifying key factors that are validated through k-fold cross-validation for robustness. Also, a confusion matrix will assess the model’s predictive accuracy and reliability which is shown in the app through accuracy percentage and graph.

To minimize the risks of misclassifying high-occupancy routes as low, we set a higher threshold to ensure only routes with a strong likelihood of low occupancy are chosen just like shown on the animation example, It will prioritize service reliability and passenger satisfaction.

By correctly predicting low-occupancy routes, we can achieve significant cost savings in maintenance, labor, and energy while avoiding wasted resources. However, misclassifying high-occupancy as low can lead to revenue loss, overcrowding, and passenger dissatisfaction. Striking the right balance in predictions ensures optimized resource allocation and supports a sustainable and efficient rail network.

2. Data Setup

Load Libraries

We loaded libraries that are required to analyze and visualize the data as shown in the code below.

knitr::opts_chunk$set(echo = TRUE)
#install.packages("hms")
#install.packages("BelgiumStatistics", repos = "http://www.datatailor.be/rcube", type = "source")
#install.packages("devtools")
#library(devtools)
#devtools::install_github("jwijffels/BelgiumMaps.Admin", build_vignettes = TRUE)
#devtools::install_github("jwijffels/StatisticsBelgium", build_vignettes = TRUE)
#library(BelgiumStatistics)
library(tidyverse)
library(tidycensus)
library(sf)
library(knitr)
library(kableExtra)
library(caret)
library(pscl)
library(mapview)
library(dplyr)
library(scales)
library(viridis)
library(spdep)
library(caret)
library(plotROC)
library(pROC)
library(ckanr)
library(FNN)
library(grid)
library(gridExtra)
library(ggcorrplot)
library(jtools)     # for regression model plots
library(broom)
library(tufte)
library(rmarkdown)
library(pander)
library(classInt)
library(ggplot2)
library(units)
library(leaflet)
library(lubridate)
library(hms)
library(riem) # for weather
library(readxl) # for read xlsx
library(ggtext)
library(showtext) # to set up font in plots
font_add_google("Roboto", "roboto") # to set up font in plots
showtext_auto(TRUE)
library(geosphere) # to calculate distance from two different geolocation (lat, lng)

source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")
colorPallete <- c("low" = "#df543b", "medium" = "#a1b7b8", "high" = "#a1b7b8")
plotTheme <- theme(
  plot.title = element_text(face="bold", hjust = 0, size=15, lineheight=0.8),
  plot.subtitle = element_text(hjust = 0, size = 10, face = "italic", lineheight=0.8, margin = margin(b = 3, t = 6)),  
        plot.caption = element_text(size = 10, hjust = 0, lineheight=0.9),
        plot.margin = margin(1.7, 1.7, 1.7, 1.7),
        text = element_text(family = "roboto"),
        legend.title = element_text(size = 10))

Load Data

We loaded three default data (1)-(3) of NMBS trains, stations, and lines from Kaggle, and four additional data (4)-(7) of Belgium statistical tracts, Belgium population data, zone station data, and weather data from RIEM package. [1]

  1. Trains dataset (trains, trains_test, trains_train, trains_trainTest, trains_trarinTrain)

The training dataset of trains provides date, time, origin station code (from), destination station code (to), vehicle ID, and occupancy information shown at three different levels (low, medium, high), from July 27, 2016 to October 29, 2016. The test dataset of trains have the same columns as the training dataset, but from October 29, 2016 to December 19, 2016. The test dataset does not have occupancy level information.

  1. Stations dataset (stations)

The stations dataset contains names, locations (longitude and latitude), and average stop times (avg_stop_times). Since NMBS operates trains in Belgium as well as France and the Netherlands, we filtered the dataset to exclude non-Belgian stations (e.g. those in the Netherlands, France, Luxembourg etc.). Additionally, we created columns to calculate the distance between origin and destination stations, the distance from Belgium to the origin and destination stations, and the bearing between origin and destination stations.

  1. Lines dataset (line_info)

The lines dataset includes information about the vehicle ID, vehicle type, the number of stops, and stopping station IDs. Since the information in the train dataset is insufficient, we did not use stopping station IDs in this analysis.

  1. Belgium statistical tracts dataset (statBel and statBel_muni)

The Belgium statistical tracts dataset contains the geometries of each tract. We gained this dataset from Statbel, the Belgian statistical office. We joined this dataset to calculate the population of Belgium by municipality and intersect this dataset to the stations and the weather stations.

  1. Belgium population dataset (popBel)

We assumed that the population of the origin and destination regions would affect the occupancy level of the trains. We used the Belgium population dataset to calculate the population of Belgium by municipality and joined this dataset to the statistical tracts dataset.

  1. Zone station dataset (stationZone)

The zone station dataset contains the zone information of each train station. At first, we assumed the fare information may be one of the independent variable so we added this information to the stations dataset to calculate the fare. However, we could not find the opened fare information of NMBS, so we did not use this information in the analysis.

  1. Weather dataset (weather_belgium, weatherstations_with_muni)

We assumed that the weather conditions of the origin and destination regions at specific dates and times would affect the occupancy level of the trains. The dataset is from RIEM, which is developed by Iowa Environment Mesonet. Since RIEM provides global coverage, we were able to get temperature, wind, and relative humidity data from 12 weather stations located in Belgium airports. Since the percipitation data of Belgium was unavailable in RIEM, we used relative humidity as a substitute variable to reflect general patterns related to precipitation.

# Load (1) trains, (2) stations, and (3) SNCB lines data
line_info <- read.csv("./Data/line_info.csv") 
trains_test <- read.csv("./Data/trains_test.csv") 
trains_train <- read.csv("./Data/trains_train.csv") 
stations <- read.csv("./Data/stations.csv")

# Load (4) Belgium statistical tracts and (5) population data
statBel <- st_read("./Data/sh_statbel_statistical_sectors_3812_20240101.geojson") %>%
  st_transform(4326) 
popBel <- read_excel("./Data/TF_POP_STRUCT_SECTORS_2016.xlsx") # 2016 Belgium population data by tracts

# Load Zone station data to calculate fare
stationZone <- read_excel("./Data/station-zone.xlsx") 

# Filter stations data to exclude non-Belgian stations (Netherlands, France, etc.)
stations <- stations %>%
  filter(country.code=="be")
stations <- left_join(stations, stationZone, by = c("name" = "Station")) # Add zone information to stations)

# Stat tracts and population of Belgium by municipality
statBel_muni <- statBel %>%
  group_by(cd_dstr_refnis) %>%
  summarize(geometry = st_union(geometry)) %>%
  ungroup() 

statBel_muni <- statBel_muni %>%
  left_join(
    statBel %>% st_drop_geometry() %>% select(cd_dstr_refnis, tx_adm_dstr_descr_fr) %>% distinct(),
    by = "cd_dstr_refnis"
  )

popBel_muni <- popBel %>%
  mutate(CD_REFNIS_N = as.character(as.numeric(substr(CD_REFNIS, 1, 2)) * 1000)) %>%
  select(CD_REFNIS_N, POPULATION, TX_DESCR_FR) %>%
  group_by(CD_REFNIS_N) %>%
  summarize(POPULATION = sum(POPULATION)) 

statBel_muni <- left_join(statBel_muni, popBel_muni, by = c("cd_dstr_refnis" = "CD_REFNIS_N"))

# Clean up the URI column to extract the station IDs
stations <- stations %>%
  mutate(station_id = gsub("http://irail.be/stations/NMBS/", "", URI))

# Convert stations dataset to sf object
stations_sf <- stations %>%
  st_as_sf(coords = c("longitude", "latitude"), crs = 4326)

Data Wrangling

We transformed raw data into a more usable format for analysis through the following processs. First, we combined the training and test datasets, which were initially separated, to manipulate the entire dataset at once. After we finished the data wrangling, we split them back into training and test sets.

Second, we recoded occupancy data from three levels (low, medium, high) to two levels (low, high). This was done to focus more on analyzing trends in low occupancy data, and to simplify regression analysis process by using a binomial model instead of a multinomial one.

[1]

  1. Low: There are plenty of seats left.

  2. Medium: It is hard to find a seat and it is difficult to sit together.

  3. High: There are no seats left and people have to stand up

  1. Low: There are plenty of seats left.

  2. High: It is hard to find a seat and it is difficult to sit together or there are no seats left and people have to stand up.

Lastly, we created a panel dataset called trains by performing a left_join of internal and external datasets. This incorporated both dependent and independent variables of the following into one table. The detailed is as shown in Table 1.

  • Primary internal data: datetime (date and time, dbl), week (week number, chr), dotw (day of week), interval60 (hourly interval), from_x (origin station information, e.g. name, latitude, longitude etc.), to_x (destination station information, e.g. name, latitude, longitude etc.), vehicle (train number), occupancy_original (occupancy of three levels), occupancy (occupancy of two levels), avg_stop_times (average stop times of each line), n_stops (number of stops of each line)

  • Primary external data:

  1. Administration sector: cd_dstr_refnis (borough number), tx_adm_dstr_descr_fr (name of borough that origin and destination stations are located)

  2. Population: POPULATION (population of borough that origin and destination stations are located)

  3. Weather: temp, humid, wind (temperature, humidity, and wind speed of each hour of day at origin and destination stations), nearest_weather_station (the nearest weather station location of origin and destination stations)

  4. Event: event (weekend and holiday information) Event: 0 = Weekday, 1 = Weekend or holiday

Table 1. Variable Type and Description
Variable Type Description
from/to <chr> 9-digit ID of each origin and destination station
vehicle <chr> ID of train
occupancy_original <chr> Train occupancy of three levels (low, medium, high)
occupancy <chr> Train occupancy of two levels (low, high)
interval60 <dttm> Hourly interval
week <dbl> Week number
datetime <dttm> Date and time
dotw <ord> Day of week
from/to_station <chr> Name of each origin and destination station
from/to_avg_stop_times <dbl> The average number of vehicles stopping each day in this station
from/to_lng <dbl> Longitude of each origin and destination station
from/to_lat <dbl> Latitude of each origin and destination station
from/to_pop <dbl> Population of each origin and destination station
from/to_cd_dstr_refnis <chr> 5-digit borough number that each origin and destination station is located
from/to_tx_adm_dstr_descr_fr <chr> Name of a borough that each origin and destination station is located
from/to_nearest_weatherstation <chr> 4-letter code of the nearest weather station from each origin and destination train station
from/to_temp <dbl> Temperature of each origin and destination train station in Celsius
from/to_humid <dbl> Relative humidity of each origin and destination train station in percentage
from/to_wind <dbl> Wind of each origin and destination train station in meters per second
vehicle_type <chr> 4-letter code of vehicle type
nr_of_stops <int> Number of stops of this O-D pair
event <dbl> Weekend and holiday information (0 = Weekday, 1 = Weekend or holiday)
n_high <int> Frequency of high occupancy of this O-D pair during Jul-Oct 2016
n_low <int> Frequency of low occupancy of this O-D pair during Jul-Oct 2016
dist_from_O_to_Brussels <dbl> Distance from an origin station to Brussels in kilometers
dist_from_D_to_Brussels <dbl> Distance from a destination station to Brussels in kilometers
dist_from_O_to_D <dbl> Distance between an origin and a destination station in kilometers
bearing_from_O_to_D <dbl> Bearing angle from an origin to a destination station
bearing_from_O_to_D_cat <chr> Categorized bearing angle (N, S, E, W, NE, NW, SE, SW)
time_cat <fct> Categorized time by commute, morning, noon, evening, midnight
date_numeric <dbl> Numerized date
time_numeric <dbl> Numerized time
occupancy_numeric <dbl> Numerized occupancy (0 = high, 1 = low)
# Change station code from numeric to character and rbind train and test sets
trains_test$from <- as.character(trains_test$from)
trains_test$from <- paste0("00", trains_test$from) 
trains_test$to <- as.character(trains_test$to)
trains_test$to <- paste0("00", trains_test$to) 
trains_test$occupancy <- NA

# Rbind train and test sets
trains <- rbind(trains_train, trains_test)
#trains$occupancy <- factor(trains$occupancy, levels = c("low", "medium", "high"))
# Believe this causes regression error

# Convert date and time to POSIX datetime
trains$datetime <- as.POSIXct(paste(trains$date, trains$time), format = "%Y-%m-%d %I:%M:%S %p")
trains$week <- week(trains$datetime)
trains$dotw <- wday(trains$datetime, label = TRUE)

# Update vehicle ID
trains <- trains %>%
  mutate(vehicle = case_when(
    grepl("^\\d+$", vehicle) & vehicle %in% c("7006", "7966", "8011", "8015") ~ paste0("P", vehicle),
    grepl("^\\d+$", vehicle) ~ paste0("IC", vehicle),
    TRUE ~ vehicle
  ))

# Recode Low-medium-high to Low-high
trains$occupancy_original <- trains$occupancy
trains <- trains %>%
  mutate(occupancy = recode(occupancy, "medium" = "high")) %>%  # (1) Recode medium to high
#  mutate(occupancy = ifelse(occupancy == "medium", NA_character_, occupancy)) %>%
  # (2) Remove medium to high
  mutate(interval60=ymd_h(substr(datetime, 1, 13))) %>%
  mutate(week=week(interval60)) 

Population and Statistical tract data

As of 2016, the population of Belgium is 11,267,910. In Belgium, 43 municipalities (CD_DSTR_REFNIS), 589 Reference from National Institute of Statistics Code (CD_REFNIS), and 2,882 statistical sectors (CD_SECTOR). We performed a left join to add the municipality where the origin and destination stations are located and their population, to the trains dataset.

statBel_muni <- statBel_muni %>%
  mutate(tx_adm_dstr_descr_fr_appr = str_extract(tx_adm_dstr_descr_fr, 
                                  "(?<=Arrondissement d’|Arrondissement de )\\b[\\wÀ-ÿ\\s-]+\\b")) 

ggplot()+
  geom_sf(data = statBel_muni, aes(fill = POPULATION), color = "white")+
  geom_sf_text(
    data = statBel_muni, 
    aes(label = tx_adm_dstr_descr_fr_appr),  
    size = 3,  
    color = "white"  
  ) +
    labs(
    title = "Population of Belgium by Municipality",
    subtitle = "2016",
    caption="Source: StatBel"
  ) +
  theme_void()+plotTheme+
  theme(legend.position = "bottom")
## Warning in st_point_on_surface.sfc(sf::st_zm(x)): st_point_on_surface may not
## give correct results for longitude/latitude data

stations_with_pop <- st_join(stations_sf, statBel_muni) %>%
  st_drop_geometry()
stations_with_pop <- stations_with_pop %>%
  bind_cols(as.data.frame(st_coordinates(stations_sf))) %>%
  rename(longitude = X, latitude = Y)

# Add from and to station names and left join population into trains data
trains <- trains %>%
  left_join(stations_with_pop, by = c("from" = "station_id")) 
## Warning in left_join(., stations_with_pop, by = c(from = "station_id")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 87 of `x` matches multiple rows in `y`.
## ℹ Row 92 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
trains <- trains %>%
  select(date, time, from, to, vehicle, occupancy_original, occupancy, interval60, week, datetime, dotw, name, avg_stop_times, longitude, latitude, cd_dstr_refnis, tx_adm_dstr_descr_fr, POPULATION) %>%
  rename(from_station = name) %>%
  rename(from_avg_stop_times = avg_stop_times) %>%
  rename(from_lng = longitude) %>%
  rename(from_lat = latitude) %>%
  rename(from_pop = POPULATION) %>%
  rename(from_cd_dstr_refnis = cd_dstr_refnis) %>%
  rename(from_tx_adm_dstr_descr_fr = tx_adm_dstr_descr_fr)

trains <- trains %>%
  left_join(stations_with_pop, by = c("to" = "station_id")) 
## Warning in left_join(., stations_with_pop, by = c(to = "station_id")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 86 of `x` matches multiple rows in `y`.
## ℹ Row 92 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
trains <- trains %>%
  select(date, time, from, to, vehicle, occupancy_original, occupancy, interval60, week, datetime, dotw, from_station, from_avg_stop_times, from_lng, from_lat, from_pop, from_cd_dstr_refnis, from_tx_adm_dstr_descr_fr, name, avg_stop_times,  longitude, latitude, cd_dstr_refnis,tx_adm_dstr_descr_fr, POPULATION) %>%
  rename(to_station = name) %>%
  rename(to_avg_stop_times = avg_stop_times) %>%
  rename(to_lng = longitude) %>%
  rename(to_lat = latitude) %>%
  rename(to_pop = POPULATION) %>%
  rename(to_cd_dstr_refnis = cd_dstr_refnis)%>%
  rename(to_tx_adm_dstr_descr_fr = tx_adm_dstr_descr_fr)

Weather data

# Load weather data
weatherstations_sf <- riem_stations(network = "BE__ASOS") %>% # BE__ASOS = Belgium ASOS (Weather Stations)
  st_as_sf(coords = c("longitude", "latitude"), crs = 4326)
weatherstations <- weatherstations_sf %>% 
  pull(id)
weather_belgium <- data.frame()

for (station in weatherstations) {        # Loop through each station
  weather_data <- riem_measures(          # Fetch data for the current station
    station = station,
    date_start = "2016-07-27",
    date_end = "2016-12-19"
  ) %>%
    select(valid, tmpf, relh, sknt) %>%   # Select relevant columns (timestamp, temperature by Fahrenheit, relative humidity in percentage, wind speed in knots)
    mutate(station_id = station)          # Add a column to identify the station
  weather_belgium <- bind_rows(weather_belgium, weather_data) # Append the fetched data to the main data frame
}

weather_belgium <- weather_belgium %>%
  mutate(interval60 = ymd_h(substr(valid, 1, 13))) %>% # Extract the timestamp and convert it to a POSIXct object
  mutate(week = week(interval60),
         dotw = wday(interval60, label = TRUE)) %>%
  group_by(interval60, station_id) %>%
  summarize(Temp = (max(tmpf, na.rm = TRUE)-32)*5/9, # Temperature by Celsius
            Humid = max(relh, na.rm = TRUE), # relative humidity in percentage
            Wind = max(sknt, na.rm = TRUE)*0.51444) # Wind speed in meters per second
## Warning: There were 337 warnings in `summarize()`.
## The first warning was:
## ℹ In argument: `Temp = (max(tmpf, na.rm = TRUE) - 32) * 5/9`.
## ℹ In group 6078: `interval60 = 2016-08-18 10:00:00` and `station_id = "EBBE"`.
## Caused by warning in `max()`:
## ! no non-missing arguments to max; returning -Inf
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 336 remaining warnings.
## `summarise()` has grouped output by 'interval60'. You can override using the
## `.groups` argument.
# Convert weather dataset to sf object
weatherstations_with_muni <- st_join(weatherstations_sf, statBel_muni) %>%
  st_drop_geometry()

# Find nearest weather station for municipalities without a station
statBel_muni <- statBel_muni %>%
  mutate(
    weatherstation_within = st_intersects(geometry, weatherstations_sf, sparse = FALSE) %>%
      apply(1, function(x) if (any(x)) weatherstations_sf$id[which(x)[1]] else NA)  # Replace `id` with the column identifying stations
  )
statBel_muni <- statBel_muni %>%
  mutate(
    nearest_weatherstation = ifelse(
      is.na(weatherstation_within),
      weatherstations_sf$id[st_nearest_feature(geometry, weatherstations_sf)],  # Replace `id` with station identifier
      weatherstation_within
    )
  )
statBel_muni <- statBel_muni %>% select(-weatherstation_within)

# left join weather station data into trains data
trains <- trains %>%
  left_join(statBel_muni, by=c("from_cd_dstr_refnis" = "cd_dstr_refnis")) %>%
  rename(from_nearest_weatherstation = nearest_weatherstation) %>%
  select(-geometry, -POPULATION)

trains <- trains %>%
  left_join(statBel_muni, by=c("to_cd_dstr_refnis" = "cd_dstr_refnis")) %>%
  rename(to_nearest_weatherstation = nearest_weatherstation) %>%
  select(-geometry, -POPULATION)

# left join weather data into trains data
trains <- trains %>%
  left_join(weather_belgium, by = c("interval60" = "interval60", "from_nearest_weatherstation" = "station_id")) %>%
  mutate(from_temp = Temp) %>%
  mutate(from_humid = Humid) %>%
  mutate(from_wind = Wind) %>%
  select(-Temp, -Humid, -Wind)  

trains <- trains %>%
  left_join(weather_belgium, by = c("interval60" = "interval60", "to_nearest_weatherstation" = "station_id")) %>%
  mutate(to_temp = Temp) %>%
  mutate(to_humid = Humid) %>%
  mutate(to_wind = Wind) %>%
  select(-Temp, -Humid, -Wind) 

trains <- trains %>%
  mutate(from_humid = case_when(is.infinite(from_humid) ~ NA_real_,
            TRUE ~ from_humid)) %>%
  mutate(to_humid = case_when(is.infinite(to_humid) ~ NA_real_,
            TRUE ~ to_humid))
ggplot()+
  geom_sf(data=statBel_muni , fill="#a1b7b8", color="#333")+
  geom_sf(data=weatherstations_sf, color="#df543b", size=1)+
  geom_sf_text(data = statBel_muni ,
               aes(label = cd_dstr_refnis),  
               size = 3, color = "black") +
  labs(
    title = "Weather Station Location in Belgium",
        subtitle = "Weather data from RIEM package",
    caption = "Source: StatBel, RIEM package by Maelle Salmon"
  ) +
  theme_void()+plotTheme

grid.arrange(top = "Weather DATA - Belgium - Jul - Dec 2016",
              ggplot(weather_belgium, aes(interval60, Humid)) + geom_line(aes(color=station_id))+plotTheme,
              ggplot(weather_belgium, aes(interval60, Wind)) + geom_line(aes(color=station_id))+plotTheme,
              ggplot(weather_belgium, aes(interval60, Temp)) + geom_line(aes(color=station_id))+plotTheme)

Vehicle data

Train types in Belgium is as the following.[2] In this analysis, we removed “EUR”, “ICE”, “ICT”, “TGV”, “TRN”, and other undefined vehicle code.

InterCity (IC): IC trains connect Belgium’s large cities. These trains only stop at the biggest train stations and sometimes cross international borders.

Peak (P): P trains run during peak travel times. They provide additional alternatives when you’re travelling during busy periods. Most of these trains run in the mornings and in late afternoon.

Local trains (L): L trains generally connect cities, but they also stop at every station along the route.

S-trains (S): S trains are suburban trains that connect the city centre with the surrounding communes. S trains stop in most stations along the route.

EXTRA: Additional train services, used at exceptionally busy periods. For example, these are the trains that travel towards the Belgian coast on very sunny days.

T (Touristic): Additional train service to certain tourist destinations.

EXP (Coast Express): Extra train service to the Belgian coast during the summer period. International trains (INT = EC, THA, TGV, ICE, EST): Regular international trains, namely Eurocity, Thalys, TGV, ICE and Eurostar.

# Left join line_data into trains data
trains <- trains %>%
   left_join(line_info, by = c("vehicle" = "vehicle_id")) 

# Clean up vehicle type
trains <- trains %>%
  mutate(
    vehicle_type = ifelse(
      is.na(vehicle_type) | vehicle_type == "",  
      gsub("[0-9]", "", vehicle),           
      vehicle_type                          
    )
  )
trains <- trains %>%
  mutate(vehicle_type = case_when(
    vehicle_type == "ic" & row_number() == min(which(vehicle_type == "ic")) ~ "IC",
    TRUE ~ vehicle_type
  ))
trains <- trains %>%
  mutate(vehicle = case_when(
    vehicle == "ic2029" & row_number() == min(which(vehicle == "ic2029")) ~ "IC2029",
    TRUE ~ vehicle
  ))

# Filter few vehicle_type
trains <- trains %>%
  filter(!(vehicle_type %in% c("EUR", "ICE", "ICT", "TGV", "TRN", "undefined", "(null)")))

Event data

# Create event data for weekends and holidays
trains <- trains %>%
  mutate(event = case_when(dotw == "Sat" ~ 1,
                           dotw == "Sun" ~ 1,
                           ymd(interval60) >= "2016-07-28" & ymd(interval60) <= "2016-09-04" ~ 1,
                           ymd(interval60) == "2016-11-03" | ymd(interval60) == "2016-11-04" | ymd(interval60) == "2016-11-05" ~ 1,
                           ymd(interval60) == "2016-11-13" | ymd(interval60) == "2016-11-14" | ymd(interval60) == "2016-11-15" ~ 1,
                           TRUE ~ 0))
## Warning: There were 8 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `event = case_when(...)`.
## Caused by warning:
## !  3732 failed to parse.
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 7 remaining warnings.

Distance and Bearing Angle from Origin to Destination

# Calculate the distance from O to Brussels, from D to Brussels, and from O to D in kilometers
trains <- trains %>%
  mutate(
    dist_from_O_to_Brussels = pmap_dbl(
      list(from_lng, from_lat), 
      ~ distHaversine(c(4.3517, 50.8503), c(..1, ..2)) / 1000
    ),
    dist_from_D_to_Brussels = pmap_dbl(
      list(to_lng, to_lat),
      ~ distHaversine(c(4.3517, 50.8503), c(..1, ..2)) / 1000
    ),
    dist_from_O_to_D = pmap_dbl(
      list(from_lng, from_lat, to_lng, to_lat),
      ~ distHaversine(c(..1, ..2), c(..3, ..4)) / 1000
    )
  )

# Calculate the bearing angle from O to D
deg2rad <- function(deg) {
  return(deg * pi / 180)
}
rad2deg <- function(rad) {
  return(rad * 180 / pi)
}
calculate_bearing <- function(lat1, lon1, lat2, lon2) {
  phi1 <- deg2rad(lat1)
  phi2 <- deg2rad(lat2)
  delta_lambda <- deg2rad(lon2 - lon1)
  
  theta <- atan2(
    sin(delta_lambda) * cos(phi2),
    cos(phi1) * sin(phi2) - sin(phi1) * cos(phi2) * cos(delta_lambda)
  )
  
  bearing <- (rad2deg(theta) + 360) %% 360
  return(bearing)
}

trains$bearing_from_O_to_D <- calculate_bearing(trains$from_lat, trains$from_lng, trains$to_lat, trains$to_lng)

trains$bearing_from_O_to_D_cat <- case_when(
  trains$bearing_from_O_to_D >= 337.5 | trains$bearing_from_O_to_D < 22.5 ~ "N",
  trains$bearing_from_O_to_D >= 22.5 & trains$bearing_from_O_to_D < 67.5 ~ "NE",
  trains$bearing_from_O_to_D >= 67.5 & trains$bearing_from_O_to_D < 112.5 ~ "E",
  trains$bearing_from_O_to_D >= 112.5 & trains$bearing_from_O_to_D < 157.5 ~ "SE",
  trains$bearing_from_O_to_D >= 157.5 & trains$bearing_from_O_to_D < 202.5 ~ "S",
  trains$bearing_from_O_to_D >= 202.5 & trains$bearing_from_O_to_D < 247.5 ~ "SW",
  trains$bearing_from_O_to_D >= 247.5 & trains$bearing_from_O_to_D < 292.5 ~ "W",
  trains$bearing_from_O_to_D >= 292.5 & trains$bearing_from_O_to_D < 337.5 ~ "NW"
)

Time categorization

The following plot shows a distribution between high and low occupancy levels by hour of the day. The train schedule was concentrated at 06:00-08:59 and 16:00-19:59. This result corresponds to the typical rush hour in Belgium, which is 08:00-17:00. Despite the large number of trains in this time zone, the low occupancy rate was low at around 33%. However, the low occupancy rate was about 62% at 10:00-14:59.

# Categorize time
time_category <- function(time) {
  hour <- as.numeric(format(time, "%H"))
  if ((hour >= 6 & hour < 9) | (hour >= 16 & hour < 20)) {
    return("commute time")
  } else if (hour >= 0 & hour < 6) {
    return("morning")
  } else if (hour >= 9 & hour < 16) {
    return("noon")
  } else {
    return("evening")
  }
}

trains <- trains %>%
  mutate(time_cat = sapply(datetime, time_category))
trains$time_cat <- factor(trains$time_cat, levels = c( "morning", "commute time", "noon", "evening"))

3. Exploratory Data Analysis

Location of Stations

The map below shows the location of 567 train stations in Belgium. The colored stations are primary stations that represents each zone. SNCB operates trains in Belgium as well as France and the Netherlands, but we filtered the dataset to exclude non-Belgian stations (e.g. those in the Netherlands, France, Luxembourg etc.) in this analysis.

# Create a leaflet map
pal <- colorFactor(
  palette = "Set1",  
  domain = stations_sf$Zone  
)

leaflet(stations_sf) %>%
  addTiles() %>%
  addCircleMarkers(
    label = ~name,
    radius = 2,
    color = ~pal(Zone),
    fillOpacity = 0.9
  ) %>%
  addLegend("bottomright", pal=pal, values=~Zone, labels = "Stations", title = "Station Map")

Distribution of Train Occupancy Levels

Trains data from July to October 2016 shows that more than 40% of trains had low occupancy as shown below.

# Occupancy distribution
trains %>%
  subset(!is.na(occupancy)) %>%
  count(occupancy) %>%
  ggplot(aes(x = occupancy, y = n, fill = occupancy)) +
  geom_bar(stat = "identity") +
  geom_text(
    aes(label = paste0(n, " (", round(n / sum(n) * 100, 1), "%)")), 
    vjust = -0.5  
  ) +
  scale_fill_manual(values = colorPallete) +
  labs(
    title = "Distribution of Train Occupancy Levels",
    subtitle = "Training data, July - Oct 2016",
    x = "Occupancy Level",
    y = "Count of Trains"
  ) +
  theme_minimal()+plotTheme+
  theme(legend.position = "none")

Distribution of Train Occupancy Levels by Time Trend

(1) Train Occupancy Trend by Hour of the Day

The following plot shows a distribution between high and low occupancy levels by hour of the day. The train schedule was concentrated at 06:00-08:59 and 16:00-19:59. This result corresponds to the typical rush hour in Belgium, which is 08:00-17:00. Despite the large number of trains in this time zone, the low occupancy rate was low at around 33%. However, the low occupancy rate was about 62% at 10:00-14:59.

# Time trend
trains %>%
  subset(!is.na(occupancy)) %>%
  group_by(hour = hour(datetime), occupancy) %>%
  count()  %>%
  ggplot(aes(x = factor(hour), y = n, fill = occupancy)) +  
  geom_bar(stat = "identity") +  
  geom_text(
    aes(label = paste0(n)),
    position = position_stack(vjust = 0.5)
  ) +
  scale_fill_manual(values = colorPallete) +  
  labs(
    title = "Train Occupancy Trend by Hour of the Day",
    subtitle = "Training data, July - Oct 2016",
    x = "Hour of Day",
    y = "Count of Trains",
    fill = "Occupancy Level"
  ) +
  theme_minimal()+plotTheme+
  theme(legend.position = "bottom")

sum_trains <- trains %>%
  subset(!is.na(occupancy)) %>%
  group_by(hour = hour(datetime), occupancy) %>%
  count() %>%
  pivot_wider(names_from = occupancy, values_from = n, values_fill = 0) %>%  
  mutate(
    ratio = low / (low + high),  
    total = low + high           
  )
trains %>%
  subset(!is.na(occupancy)) %>%
  group_by(hour = hour(datetime)) %>%
  count(occupancy) %>%
  ggplot(aes(x = hour, y = n, color = occupancy)) +
  geom_line() +
  scale_color_manual(values = colorPallete) +
  labs(
    title = "Train Occupancy Trend by Hour of the Day",
        subtitle = "Training data, July - Oct 2016",
    x = "Hour of Day",
    y = "Count"
  ) +
  theme_minimal()+plotTheme+
  theme(legend.position = "bottom")

(2) Train Occupancy Trend by Day of Week

The proportion of low occupancy routes on weekdays is lower than on weekends. According to the plot below, the average rate of low occupancy routes is 40.7% on weekdays (Monday to Friday), while 43.8% on weekends (Saturday and Sunday).

trains %>%
  subset(!is.na(occupancy)) %>%
  mutate(date = as.Date(datetime)) %>%  
  group_by(dotw, occupancy) %>%
  count() %>%
  ggplot(aes(x = dotw, y = n, fill=occupancy)) +  # Use color for lines
  geom_bar(stat = "identity") + 
   geom_text(
    aes(label = paste0(n)),
    position = position_stack(vjust = 0.5)
  ) +
  scale_fill_manual(values = colorPallete) +  
  labs(
    title = "Train Occupancy Trend by Day of Week",
    subtitle = "Training data, July - Oct 2016",
    x = "Day",
    y = "Count of Trains",
    color = "Occupancy Level"
  ) +
  theme_minimal()+plotTheme+
  theme(legend.position = "bottom")

sum_trains2 <- trains %>%
  subset(!is.na(occupancy)) %>%
  group_by(dotw, occupancy) %>%
  count() %>%
  pivot_wider(names_from = occupancy, values_from = n, values_fill = 0) %>%  
  mutate(
    ratio = low / (low + high),  
    total = low + high           
  )
trains %>%
  subset(!is.na(occupancy)) %>%
  mutate(date = as.Date(datetime)) %>%  
  group_by(dotw, occupancy) %>%
  summarise(n = n(), .groups = "drop") %>%  # Use summarise instead of count
  ggplot(aes(x = dotw, y = n, color = occupancy, group = occupancy)) +
  geom_line(size = 1) +
  scale_color_manual(values = colorPallete) +  
  labs(
    title = "Train Occupancy Trend by Day of Week",
            subtitle = "Training data, July - Oct 2016",
    x = "Day",
    y = "Count of Trains",
    color = "Occupancy Level"
  ) +
  theme_minimal()+plotTheme+
  theme(legend.position = "bottom")

(3) Train Occupancy Trend by Date

When comparing the trains data from mid-August to mid-September with the data from mid-September to the end of October, we observed that the number of record is significantly different. One possible reason we guess is missing data. Another reason could be that SNCB holiday period from the end of July to end of August, during which certain IC, S and P trains do not run, so that there is possibly a reduced number of trains during this period. In this analysis, we assumed that there were no missing data, and set event as 1 for the date of holiday period as well as weekends and public holidays.

SNCB Calendar of 2024
SNCB Calendar of 2024
trains %>%
  subset(!is.na(occupancy)) %>%
  mutate(date = as.Date(datetime)) %>%  
  group_by(date, hour = hour(datetime), occupancy) %>%
  count() %>%
  ggplot(aes(x = date, y = n, fill=occupancy)) +  # Use color for lines
  geom_bar(stat = "identity") +  # Stacked bar chart
  scale_fill_manual(values = colorPallete) +  
  labs(
    title = "Train Occupancy Trend by Date",
    subtitle = "Training data, July - Oct 2016",
    x = "Date and Hour",
    y = "Count of Trains",
    color = "Occupancy Level"
  ) +
  theme_minimal()+plotTheme+
  theme(legend.position = "bottom")

sum_trains3 <- trains %>%
  subset(!is.na(occupancy)) %>%
  group_by(date, dotw, occupancy) %>%
  count() %>%
  pivot_wider(names_from = occupancy, values_from = n, values_fill = 0) %>%  
  mutate(
    ratio = low / (low + high),  
    total = low + high           
  ) %>%
  filter(total >=5)
trains %>%
  subset(!is.na(occupancy)) %>%
  group_by(date = date(datetime)) %>%
  count(occupancy) %>%
  ggplot(aes(x = date, y = n, color = occupancy)) +
  geom_line() +
  scale_color_manual(values = colorPallete) +
  labs(
    title = "Train Occupancy Trend by Date",
        subtitle = "Training data, July - Oct 2016",
    x = "Date",
    y = "Count"
  ) +
  theme_minimal()+plotTheme+
  theme(legend.position = "bottom")

In sum, most prominent low occupancy routes are during off-peak hours, on weekends, and scattered throughout the date on early July-August. Meanwhile, high occupancy routes are strong during commuter hour, on weekdays, and on mid-late October.

Distribution of Train Occupancy Levels by OD Pairs

(1) Top 25 Count of Trains OD Pairs

# Filter high occupancy
high_occupancy <- trains %>% subset(!is.na(occupancy)) %>%
  filter(occupancy == "high") %>%
  count(from_station, to_station) %>%
  arrange(desc(n)) %>%
  rename(n_high = n)

low_occupancy <- trains %>% subset(!is.na(occupancy)) %>%
  filter(occupancy == "low") %>%
  count(from_station, to_station) %>%
  arrange(desc(n))%>%
  rename(n_low = n)

# Create and left join the frequency (count) of high and low occupancy data of origin and destination station by each OD pair
trains <- trains %>%
  left_join(high_occupancy, by = c("from_station" = "from_station", "to_station" = "to_station")) 
trains <- trains %>%
  left_join(low_occupancy, by = c("from_station" = "from_station", "to_station" = "to_station")) 

occupancy <- high_occupancy %>%
  left_join(low_occupancy, by = c("from_station" = "from_station", "to_station" = "to_station")) 

occupancy_data <- trains %>%
  subset(!is.na(occupancy)) %>%
  count(from_station, to_station, occupancy) %>%
  group_by(from_station, to_station) %>%
  mutate(total = sum(n)) 

occupancy_data %>%
  filter(!is.na(to_station)) %>%
   ungroup() %>% 
  arrange(desc(total)) %>%
  slice_head(n = 70) %>%  # Top 70 OD pairs
  ggplot(aes(
    x = reorder(paste(from_station, to_station, sep = " → "), total),
    y = n,
    fill = occupancy
  )) +
  geom_bar(stat = "identity") +  # Stacked bar chart
  coord_flip() +
  scale_fill_manual(values = colorPallete) +  
  labs(
    title = "Top 25 Count of Trains OD Pairs",
    subtitle = "Training data, July - Oct 2016",
    x = "OD Pair",
    y = "Count of Trains",
    fill = "Occupancy Level"
  ) +
  theme_minimal()+plotTheme+
  theme(legend.position = "bottom")

(2) Top 10 High Occupancy Count of Trains OD Pairs

# Plot top 10 OD pairs with high occupancy
high_occupancy %>%
  filter(!is.na(to_station)) %>%
  head(10) %>%
  ggplot(aes(x = reorder(paste(from_station, to_station, sep = " → "), n_high), y = n_high)) +
  geom_bar(stat = "identity", fill = "#a1b7b8") +
  coord_flip() +
  labs(
    title = "Top 10 High-Occupancy OD Pairs",
    subtitle = "Training data, July - Oct 2016",
    x = "OD Pair",
    y = "Count"
  ) +
  theme_minimal()+plotTheme

(3) Top 10 Low Occupancy Count of Trains OD Pairs

# Plot top 10 OD pairs with low occupancy
low_occupancy %>%
  filter(!is.na(to_station)) %>%
  head(10) %>%
  ggplot(aes(x = reorder(paste(from_station, to_station, sep = " → "), n_low), y = n_low)) +
  geom_bar(stat = "identity", fill = "#df543b") +
  coord_flip() +
  labs(
    title = "Top 10 Low-Occupancy OD Pairs",
    subtitle = "Training data, July - Oct 2016",
    x = "OD Pair",
    y = "Count"
  ) +
  theme_minimal()+plotTheme

(4) Top 25 Low Occupancy Percentage OD Pairs

# Plot top 25 OD pairs with low occupancy percentage
high_occupancy %>%
  left_join(low_occupancy, by = c("from_station" = "from_station", "to_station" = "to_station")) %>%
  filter(!is.na(to_station)) %>%
  mutate(perc = n_low/(n_high+n_low)*100) %>%
  arrange(desc(perc))%>%
  head(25) %>%
  ggplot(aes(x = reorder(paste(from_station, to_station, sep = " → "), perc), y = perc)) +
  geom_bar(stat = "identity", fill = "#df543b") +
  coord_flip() +
  labs(
    title = "Top 25 Low-Occupancy Percentage OD Pairs",
    subtitle = "Training data, July - Oct 2016",
    x = "OD Pair",
    y = "Percentage of Low Occupancy (%)"
  ) +
  theme_minimal()+plotTheme

# Distribution summary
#summary(high_occupancy$n)
#summary(low_occupancy$n)
#  group_by(from_station, to_station) %>%
#  summarise(n = sum(n)) %>%
#  ungroup()

# Visualize distribution 

ggplot(low_occupancy, aes(x = n_low)) +
  geom_histogram(binwidth = 1, fill = "blue", color = "white") +
  labs(
    title = "Distribution of Low Occupancy Counts",
    x = "Count (n)",
    y = "Frequency"
  ) +
  theme_minimal()+plotTheme

Other Feature Associations with the Likelihood of Occupancy Level

(1) Continuous Variables: Average Stop Times, Number of Stops

trains %>%
  filter(!is.na(occupancy)) %>% 
  select(occupancy, from_avg_stop_times, to_avg_stop_times, nr_of_stops) %>%
  gather(Variable, value, -occupancy) %>%
  ggplot(aes(occupancy, value, fill=occupancy)) +
  geom_bar(position="dodge", stat="summary", fun="mean") +
  facet_wrap(~Variable, scales="free") +
  scale_fill_manual(values=colorPallete) +
  labs(x="Occupancy", y="Mean", 
       title="Feature associations with the likelihood of occupancy Level",
       subtitle="Average stop times of origin and destination station and the number of stops of each line\nTraining data, July - Oct 2016", 
       caption="Source: Kaggle")+
  theme_minimal()+plotTheme+
  theme(legend.position="none")
## Warning: Removed 159 rows containing non-finite outside the scale range
## (`stat_summary()`).

(2) Continuous Variables: Temperature, Relative Humidity, Wind Speed of Origin and Destination Station

trains %>%
  filter(!is.na(occupancy)) %>% 
  select(occupancy, from_wind, to_wind, from_temp, to_temp, from_humid, to_humid) %>%
  gather(Variable, value, -occupancy) %>%
  ggplot(aes(occupancy, value, fill=occupancy)) +
  geom_bar(position="dodge", stat="summary", fun="mean") +
  facet_wrap(~Variable, scales="free") +
  scale_fill_manual(values=colorPallete) +
  labs(x="Occupancy", y="Mean", 
       title="Feature associations with the likelihood of occupancy Level",
       subtitle="Temperature and wind of origin and destination station\nTraining data, July - Oct 2016", 
       caption="Source: Kaggle")+
  theme_minimal()+plotTheme+
  theme(legend.position="none")
## Warning: Removed 568 rows containing non-finite outside the scale range
## (`stat_summary()`).

(3) Continuous Variables: Distance from Origin to Destination, Origin to Brussels, Destination to Brussels

trains %>%
  filter(!is.na(occupancy)) %>% 
  select(occupancy, dist_from_O_to_D, dist_from_O_to_Brussels, dist_from_D_to_Brussels) %>%
  gather(Variable, value, -occupancy) %>%
  ggplot(aes(occupancy, value, fill=occupancy)) +
  geom_bar(position="dodge", stat="summary", fun="mean") +
  facet_wrap(~Variable, scales="free") +
  scale_fill_manual(values=colorPallete) +
  labs(x="Occupancy", y="Mean distance (km)", 
       title="Feature associations with the likelihood of occupancy Level",
       subtitle="Distance from O, D to Brussels, and from O to D\nTraining data, July - Oct 2016", 
       caption="Source: Kaggle")+
  theme_minimal()+plotTheme+
  theme(legend.position="none")
## Warning: Removed 85 rows containing non-finite outside the scale range
## (`stat_summary()`).

(4) Categorical Variables: Bearing Angle Direction from Origin to Destination

trains %>%
  filter(!is.na(occupancy)) %>% 
  select(occupancy, bearing_from_O_to_D_cat) %>%
  gather(Variable, value, -occupancy) %>%
  count(Variable, value, occupancy) %>%
  ggplot(aes(occupancy, n, fill=occupancy)) +
  geom_bar(position="dodge", stat="summary") +
  ylim(0,250)+
  facet_wrap(~value, scales="free") +
  scale_fill_manual(values=colorPallete) +
  labs(x="Occupancy", y="Count", 
       title="Feature associations with the likelihood of low occupancy",
       subtitle="Vehicle type for IC, L, P, S, and THA\nTraining data, July - Oct 2016", 
       caption="Source: Kaggle")+
  theme_minimal()+plotTheme+
  theme(legend.position="none")
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`

(5) Binomial Variables: Event

trains %>%
  filter(!is.na(occupancy)) %>% 
  select(occupancy, event) %>%
  gather(Variable, value, -occupancy) %>%
  count(Variable, value, occupancy) %>%
  ggplot(aes(occupancy, n, fill=occupancy)) +
  geom_bar(position="dodge", stat="summary") +
  facet_wrap(~value, scales="free") +
  scale_fill_manual(values=colorPallete) +
  ylim(0,1200)+
  labs(x="Occupancy", y="Count", 
       title="Feature associations with the likelihood of low occupancy",
       subtitle="Event: 0 = Weekday,  1 = Weekend or holiday\nTraining data, July - Oct 2016", 
       caption="Source: Kaggle")+
  theme_minimal()+plotTheme+
  theme(legend.position="none")
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`

### (6) Categorical Variables: Time category

trains %>%
  filter(!is.na(occupancy)) %>% 
  select(occupancy, time_cat) %>%
  gather(Variable, value, -occupancy) %>%
  count(Variable, value, occupancy) %>%
  ggplot(aes(occupancy, n, fill=occupancy)) +
  geom_bar(position="dodge", stat="summary") +
  ylim(0,650)+
  facet_wrap(~value, scales="free") +
  scale_fill_manual(values=colorPallete) +
  labs(x="Occupancy", y="Count", 
       title="Feature associations with the likelihood of low occupancy",
       subtitle="0700-0900 and 1700-1900 commute, 1900-2400 evening, 0000-0500 midnight, 0500-0700 morning, 0900-1700 noon\nTraining data, July - Oct 2016", 
       caption="Source: Kaggle")+
  theme_minimal()+plotTheme+
  theme(legend.position="none")
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_summary()`).
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`

train_summary <- trains %>%
  filter(!is.na(occupancy)) %>% 
  group_by(dotw, time_cat) %>%
  summarize(
    total_count = n(),  
    low_count = sum(occupancy == "low", na.rm = TRUE),  
    low_ratio = low_count / total_count*100  
  ) %>%
  ungroup()
## `summarise()` has grouped output by 'dotw'. You can override using the
## `.groups` argument.
ggplot(train_summary, aes(x = dotw, y = low_ratio, fill=time_cat)) +
  geom_bar(stat = "identity", position = "dodge") +  
  scale_fill_manual(
    values = c(
      "morning" = "#1f78b4",      
      "commute time" = "#33a02c", 
      "noon" = "#e31a1c",         
      "evening" = "#ff7f00",      
      "midnight" = "#6a3d9a"      
    )
  ) +
  labs(
    title="Feature associations with the likelihood of occupancy Level",
    subtitle="Low occupancy rate by time category of each day\nTraining data, July - Oct 2016", 
    x = "Day of Week",
    y = "Low occupancy rate (%)", 
    caption="Source: Kaggle"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")+plotTheme

(7) Categorical VariableS: Vehicle Type

trains %>%
  filter(!is.na(occupancy)) %>% 
  select(occupancy, vehicle_type) %>%
  gather(Variable, value, -occupancy) %>%
  count(Variable, value, occupancy) %>%
  ggplot(aes(occupancy, n, fill=occupancy)) +
  geom_bar(position="dodge", stat="summary") +
  ylim(0,1000)+
  facet_wrap(~value, scales="free") +
  scale_fill_manual(values=colorPallete) +
  labs(x="Occupancy", y="Count", 
       title="Feature associations with the likelihood of low occupancy",
       subtitle="Vehicle type for IC, L, P, S, and THA\nTraining data, July - Oct 2016", 
       caption="Source: Kaggle")+
  theme_minimal()+plotTheme+
  theme(legend.position="none")
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`

threshold <- 3 # Moderate threshold capturing higher-demand OD pairs

# Filter high-occupancy OD pairs
high_routes <- high_occupancy %>%
  filter(n_high >= threshold) %>%
  left_join(stations, by = c("from_station" = "name")) %>%
  rename(from_lon = longitude, from_lat = latitude) %>%
  left_join(stations, by = c("to_station" = "name")) %>%
  rename(to_lon = longitude, to_lat = latitude)

low_routes <- low_occupancy %>%
  filter(n_low >= threshold) %>%
  left_join(stations, by = c("from_station" = "name")) %>%
  rename(from_lon = longitude, from_lat = latitude) %>%
  left_join(stations, by = c("to_station" = "name")) %>%
  rename(to_lon = longitude, to_lat = latitude)

# Create line data for leaflet polylines
high_routes <- high_routes %>%
  mutate(route_label = paste(from_station, "->", to_station))

low_routes <- low_routes %>%
  mutate(route_label = paste(from_station, "->", to_station))

# Step 4: Create line data for leaflet polylines
routes_lines <- high_routes %>%
  rowwise() %>%
  do({
    data.frame(
      lng = c(.$from_lon, .$to_lon),
      lat = c(.$from_lat, .$to_lat),
      route_label = .$route_label
    )
  })

routes_lines_l <- low_routes %>%
  rowwise() %>%
  do({
    data.frame(
      lng = c(.$from_lon, .$to_lon),
      lat = c(.$from_lat, .$to_lat),
      route_label = .$route_label
    )
  })

# Create a color palette for the zones
zone_colors <- colorFactor(
  palette = "Dark2",
  domain = stations_sf$Zone # Map unique zone values
)
# Create Leaflet map
leaflet() %>%
  addTiles() %>%
  # Add station markers
  addCircleMarkers(
    data = stations_sf,
    label = ~name,
    radius = 3,
    color = ~zone_colors(Zone),
    fillOpacity = 0.1
  ) %>%
  # Add OD pair lines

  addPolylines(
    data = routes_lines,
    lng = ~lng,
    lat = ~lat,
    color = "#a1b7b8",
    opacity = 0.3,
    weight = 2,
    label = ~route_label
  ) %>%
    addPolylines(
    data = routes_lines_l,
    lng = ~lng,
    lat = ~lat,
    color = "#df543b",
    opacity = 0.1,
    weight = 2,
    label = ~route_label
  ) %>%
  
  
  # Add legend
  addLegend(
    "bottomright",
    colors = c("black", "#a1b7b8",  "#df543b"),
    labels = c("Stations", "High Occupancy Routes", "Low Occupancy Routes"),
    title = "Routes Map"
  )

4. Test of Logistic Regression Model (Original Model)

trains <- trains %>%
  mutate(date_numeric = as.numeric(as.Date(date))) %>%
  mutate(time_numeric = as.numeric(as_hms(time))) %>%
  mutate(occupancy_numeric = case_when(
    occupancy == "low" ~ 1,
    occupancy == "high" ~ 0,
    TRUE ~ NA_real_
  ))

# Split trains into trains_train and trains_test  
trains_train <- trains %>% filter(!is.na(occupancy)) 
trains_test <- trains %>% filter(is.na(occupancy))

# Split trains_train into trains_trainTrain and trains_trainTest
set.seed(3456)
trainIndex <- createDataPartition(trains_train$occupancy, p = .65, 
                                  list = FALSE,
                                  times = 1)
trains_trainTrain <- trains_train[ trainIndex,]
trains_trainTest  <- trains_train[-trainIndex,]

reg <- glm(occupancy_numeric ~ ., data = 
                    trains_trainTrain %>% dplyr::select(date_numeric, time_numeric, dotw, occupancy_numeric),
                family = "binomial"(link = "logit"))

a<-summary(reg) # / AIC: 1,987
pseudo_r2 <- pR2(reg)
## fitting null model for pseudo-r2
pseudo_r2_df <- as.data.frame(t(pseudo_r2))
kable(pseudo_r2_df, "html", caption = "Pseudo-R² Values for `train logistic model`") %>%
  kable_styling(bootstrap_options = c("hover", "striped"))
Pseudo-R² Values for train logistic model
llh llhNull G2 McFadden r2ML r2CU
-984.2853 -1008.389 48.20693 0.0239029 0.0318569 0.042939
testProbs <- data.frame(Outcome = as.numeric(trains_trainTest$occupancy_numeric),
                        Probs = predict(reg, trains_trainTest, type= "response"))
#testProbs$Probs_p <- case_when(
#  testProbs$Probs >= 0.5 ~ 1,
#  testProbs$Probs < 0.5 ~ 0,
#  TRUE ~ NA
#)
ggplot(testProbs, aes(x = Probs, fill = as.factor(Outcome))) + 
  geom_density() +
  facet_grid(Outcome ~ .) +
  xlim(0,1)+
  scale_fill_manual(values = c("#a1b7b8", "#df543b")) +
  labs(x = "Prediction of Occupancy", y = "Density of probabilities",
       title = "Distribution of predicted probabilities by observed outcome") +
  theme(strip.text.x = element_text(size = 18),
        legend.position = "none")+plotTheme

ggplot() +
  geom_roc(data= testProbs, aes(d = as.numeric(Outcome), m = Probs), n.cuts = 50, labels = FALSE) +
  style_roc(theme = theme_grey) +
  geom_abline(slope = 1, intercept = 0, size = 1.5, color = 'grey') +
  labs(title = "ROC Curve ",
       subtitle = "trains_trainTest that is split of trains_train") +
  theme(legend.position = "bottom")+plotTheme

pROC::auc(testProbs$Outcome, testProbs$Probs) # AUC = 0.6277
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Area under the curve: 0.6277

Cost Benefit Analysis

5. Test of Logistic Regression Model (Improved Model)

reg <- glm(occupancy_numeric ~ ., data = 
                    trains_trainTrain %>% dplyr::select(date_numeric, time_numeric, dotw, occupancy_numeric, 
                                                        
     from_avg_stop_times, from_pop,  from_avg_stop_times, to_avg_stop_times, to_pop, from_temp, from_wind, from_humid, to_temp, to_wind, to_humid, event, nr_of_stops, dist_from_O_to_D, dist_from_O_to_Brussels, dist_from_D_to_Brussels, time_cat, vehicle_type, bearing_from_O_to_D_cat),
                family = "binomial"(link = "logit"))

b<-summary(reg) # / AIC: 1,651
pseudo_r2 <- pR2(reg)
## fitting null model for pseudo-r2
pseudo_r2_df <- as.data.frame(t(pseudo_r2))
kable(pseudo_r2_df, "html", caption = "Pseudo-R² Values for `train logistic model`") %>%
  kable_styling(bootstrap_options = c("hover", "striped"))
Pseudo-R² Values for train logistic model
llh llhNull G2 McFadden r2ML r2CU
-788.71 -883.1156 188.8112 0.1069006 0.1347994 0.1816908
testProbs <- data.frame(Outcome = (trains_trainTest$occupancy_numeric),
                        Probs = predict(reg, trains_trainTest, type= "response"))


ggplot(testProbs, aes(x = Probs, fill = as.factor(Outcome))) + 
  geom_density() +
  facet_grid(Outcome ~ .) +
  xlim(0,1)+
  scale_fill_manual(values = c("#a1b7b8", "#df543b")) +
  labs(x = "Prediction of Occupancy", y = "Density of probabilities",
       title = "Distribution of predicted probabilities by observed outcome") +
  theme(strip.text.x = element_text(size = 18),
        legend.position = "none")+plotTheme

testProbs <- testProbs %>%
  mutate(predOutcome = (ifelse(Probs >= 0.45,1,0))) # Threshold = 0.45
testProbs %>%
  filter( !is.na(predOutcome)) %>% 
  summarize(observedLow=sum(Outcome), predictedLow=sum(predOutcome)) %>%

  gather(Variable, Value) %>%
  ggplot(aes(x = Variable, y = Value))+
  geom_bar(position="dodge", stat="identity")

ggplot() +
  geom_roc(data= testProbs, aes(d = as.numeric(Outcome), m = Probs), n.cuts = 50, labels = FALSE) +
  style_roc(theme = theme_grey) +
  geom_abline(slope = 1, intercept = 0, size = 1.5, color = 'grey') +
  labs(title = "ROC Curve ",
       subtitle = "trains_trainTest that is split of trains_train") +
  theme(legend.position = "bottom")+plotTheme

pROC::auc(testProbs$Outcome, testProbs$Probs) # AUC = 0.6764
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Area under the curve: 0.6727

Cost Benefit Analysis

6. Prediction

7. Conclusion and Discussion

Conclusion

Discussion